{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e2fff5dd",
   "metadata": {},
   "source": [
    "#  LangChain with Local Llama 2 Model\n",
    "\n",
    "This notebook uses the checkpoint from the [HuggingFace Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) model.\n",
    "\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    \n",
    "⚠️ The notebook before this one, `07_Option(1)_NVIDIA_AI_endpoint_simple.ipynb`, contains the same exercise as this notebook but uses NVIDIA AI Catalog' models via API calls instead of loading the models' checkpoints pulled from huggingface model hub, and then load from host to devices (i.e GPUs).\n",
    "\n",
    "Noted that, since we will load the checkpoints, it will be significantly slower to go through this entire notebook. \n",
    "\n",
    "If you do decide to go through this notebook, please kindly check the **Prerequisite** section below.\n",
    "\n",
    "</div>\n",
    "\n",
    "\n",
    "## Prerequisite \n",
    "\n",
    "To run this notebook, you need the following:\n",
    "\n",
    "1. Prior approval to use the checkpoints by applying for access to the [meta-llama](https://huggingface.co/meta-llama) model.\n",
    "2. At least 2 NVIDIA GPUs, each with at least 32G mem, preferably using Ampere architecture.\n",
    "3. Installed Docker and [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit).\n",
    "4. Registered with [NVIDIA NGC](https://www.nvidia.com/en-us/gpu-cloud/) and can pull and run NGC PyTorch containers.\n",
    "5. Installed Python dependencies for this notebook, by overwriting langchain-core version: \n",
    "\n",
    "   If you are using the [Dockerfile.gpu_notebook](https://raw.githubusercontent.com/NVIDIA/GenerativeAIExamples/main/notebooks/Dockerfile.gpu_notebook), the dependency is already installed for you.\n",
    "\n",
    "The notebook will walk you through how to build an end-to-end RAG pipeline using [LangChain](https://python.langchain.com/docs/get_started/introduction), [faiss](https://python.langchain.com/docs/integrations/vectorstores/faiss) as the vectorstore and a custom llm of your choice from huggingface ( more specifically, we will be using [HuggingFace Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) in this notebook, but the process is similar for other llms from huggingface.\n",
    "\n",
    "\n",
    "Generically speaking, the RAG pipeline will involve 2 phases -\n",
    "\n",
    "The first one is the preprocessing phase illustrated below -"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e5f1b2e",
   "metadata": {},
   "source": [
    "![preprocessing](./imgs/preprocessing.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5171bf3",
   "metadata": {},
   "source": [
    "The second phase is the inference runtime -\n",
    "\n",
    "![inference_runtime](./imgs/inference_runtime.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be262e28",
   "metadata": {},
   "outputs": [],
   "source": [
    "# if you have only CPU, please use faiss-cpu instead\n",
    "#!pip install faiss-gpu accelerate"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "091cf88b",
   "metadata": {},
   "source": [
    "---\n",
    "Let's now go through this notebook step-by-step \n",
    "For the first phase, reminder of the flow \n",
    "![preprocessing](./imgs/preprocessing.png)\n",
    "\n",
    "### Step 1 - Load huggingface embedding "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd2ac192",
   "metadata": {},
   "outputs": [],
   "source": [
    "### load custom embedding and use it in Faiss\n",
    "from langchain.vectorstores import FAISS\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from langchain.chains import RetrievalQA\n",
    "from langchain.document_loaders import TextLoader\n",
    "from langchain.document_loaders import PyPDFLoader\n",
    "from langchain.document_loaders import DirectoryLoader\n",
    "from langchain.embeddings import HuggingFaceEmbeddings\n",
    "\n",
    "embedding_model_name = \"sentence-transformers/all-mpnet-base-v2\" # sentence-transformer is the most commonly used embedding\n",
    "emd_model_kwargs = {\"device\": \"cuda\"}\n",
    "hf_embedding = HuggingFaceEmbeddings(model_name=embedding_model_name, model_kwargs=emd_model_kwargs)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8de16863",
   "metadata": {},
   "source": [
    "### Step 2 - Prepare the toy text dataset \n",
    "We will prepare the XXX.txt files ( there should be Sweden.txt and and using the above embedding to parse chuck of text and store them into faiss-gpu vectorstore"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f34e898",
   "metadata": {},
   "source": [
    "Let's have a look at text datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2be8220d",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "head -1 ./toy_data/Sweden.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed9673f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "head -3 ./toy_data/Titanic_film.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4e02a5d",
   "metadata": {},
   "source": [
    "### Step 3 -  Process the document into faiss vectorstore and save to disk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34030fba",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from tqdm import tqdm\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from langchain.vectorstores import FAISS\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from pathlib import Path\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "import faiss\n",
    "from langchain.vectorstores import FAISS,utils\n",
    "import pickle\n",
    "\n",
    "# Here we read in the text data and prepare them into vectorstore\n",
    "ps = list(Path(\"./toy_data/\").glob('**/*.txt'))\n",
    "print(ps)\n",
    "data = []\n",
    "sources = []\n",
    "for p in ps:\n",
    "    with open(p,encoding=\"utf-8\") as f:\n",
    "        data.append(f.read())\n",
    "    sources.append(p)\n",
    "\n",
    "# We do this due to the context limits of the LLMs.\n",
    "# Here we split the documents, as needed, into smaller chunks.\n",
    "# We do this due to the context limits of the LLMs.\n",
    "\n",
    "text_splitter = CharacterTextSplitter(chunk_size=200, separator=\"\\n\")\n",
    "docs = []\n",
    "metadatas = []\n",
    "for i, d in enumerate(data):\n",
    "    splits = text_splitter.split_text(d)\n",
    "    docs.extend(splits)\n",
    "    metadatas.extend([{\"source\": sources[i]}] * len(splits))\n",
    "\n",
    "# Here we create a vector store from the documents and save it to disk.\n",
    "store = FAISS.from_texts(docs, hf_embedding, metadatas=metadatas)\n",
    "\n",
    "store.save_local('./toy_data/nv_embedding')\n",
    "# you will only need to do this once, later on we will restore the already saved vectorstore\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37197d21",
   "metadata": {},
   "source": [
    "### Step 4 - Reload the already saved vectorstore and prepare for retrival"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "89bfaf0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the LangChain.\n",
    "from pathlib import Path\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "import faiss\n",
    "from langchain.vectorstores import FAISS\n",
    "import pickle\n",
    "\n",
    "store = FAISS.load_local(\"./toy_data/nv_embedding\",hf_embedding )\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87c4422a",
   "metadata": {},
   "source": [
    "\n",
    "### Step 5 - Prepare the loaded vectorstore into a retriver "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b789e6ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = store.as_retriever(search_type='similarity', search_kwargs={\"k\": 3}) # k is a hyperparameter, usally by default set to 3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "798daed5",
   "metadata": {},
   "source": [
    "Now we are finally done with the preprocessing step, next we will proceed to phase 2\n",
    "\n",
    "--- \n",
    "Recall phase 2 involve a runtime which we could query the already loaded faiss vectorstore. \n",
    "\n",
    "![inference](./imgs/inference_runtime.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bdfa8b5",
   "metadata": {},
   "source": [
    "### Step 6 - Load the [HuggingFace Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) to your GPUs\n",
    "\n",
    "Note: Scroll down and make sure you supply the **hf_token in code block below [FILL_IN] your huggingface token**\n",
    ", for how to generate the token from huggingface, please following instruction from [this link](https://huggingface.co/docs/transformers.js/guides/private)\n",
    "\n",
    "Note: The execution of cell below will take up sometime, please be patient until the checkpoint is fully loaded. Alternatively, turn to previous notebook 07_Option(1)_NVIDIA_AI_endpoint_simply.ipynb if you wish to use already deployed models as API calls instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76d71c9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch\n",
    "import transformers\n",
    "from langchain import HuggingFacePipeline\n",
    "from transformers import (\n",
    "    AutoConfig,\n",
    "    AutoModel,\n",
    "    AutoModelForCausalLM,\n",
    "    AutoTokenizer,\n",
    "    GenerationConfig,\n",
    "    LlamaForCausalLM,\n",
    "    LlamaTokenizer,\n",
    "    pipeline,\n",
    ")\n",
    "\n",
    "def load_model(model_name_or_path, device, num_gpus, hf_auth_token=None, debug=False):\n",
    "    \"\"\"Load an HF locally saved checkpoint.\"\"\"\n",
    "    if device == \"cpu\":\n",
    "        kwargs = {}\n",
    "    elif device == \"cuda\":\n",
    "        kwargs = {\"torch_dtype\": torch.float16}\n",
    "        if num_gpus == \"auto\":\n",
    "            kwargs[\"device_map\"] = \"auto\"\n",
    "        else:\n",
    "            num_gpus = int(num_gpus)\n",
    "            if num_gpus != 1:\n",
    "                kwargs.update(\n",
    "                    {\n",
    "                        \"device_map\": \"auto\",\n",
    "                        \"max_memory\": {i: \"20GiB\" for i in range(num_gpus)},\n",
    "                    }\n",
    "                )\n",
    "    elif device == \"mps\":\n",
    "        kwargs = {\"torch_dtype\": torch.float16}\n",
    "        # Avoid bugs in mps backend by not using in-place operations.\n",
    "        print(\"mps not supported\")\n",
    "    else:\n",
    "        raise ValueError(f\"Invalid device: {device}\")\n",
    "\n",
    "    if hf_auth_token is None:\n",
    "        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)\n",
    "        model = AutoModelForCausalLM.from_pretrained(\n",
    "            model_name_or_path, low_cpu_mem_usage=True, **kwargs\n",
    "        )\n",
    "    else:\n",
    "        tokenizer = AutoTokenizer.from_pretrained(\n",
    "            model_name_or_path, use_auth_token=hf_auth_token, use_fast=False\n",
    "        )\n",
    "        model = AutoModelForCausalLM.from_pretrained(\n",
    "            model_name_or_path,\n",
    "            low_cpu_mem_usage=True,\n",
    "            use_auth_token=hf_auth_token,\n",
    "            **kwargs,\n",
    "        )\n",
    "\n",
    "    if device == \"cuda\" and num_gpus == 1:\n",
    "        model.to(device)\n",
    "\n",
    "    if debug:\n",
    "        print(model)\n",
    "\n",
    "    return model, tokenizer\n",
    "\n",
    "model_name=\"meta-llama/Llama-2-13b-chat-hf\"\n",
    "device = \"cuda\"\n",
    "num_gpus = 2  ## minimal requirement is that you have 2x NVIDIA GPUs\n",
    "\n",
    "## Remember to supply your own huggingface access token\n",
    "hf_token= \"[FILL_IN]\"\n",
    "model, tokenizer = load_model(model_name, device, num_gpus,hf_auth_token=hf_token, debug=False)\n",
    "\n",
    "pipe = pipeline(\n",
    "    \"text-generation\",\n",
    "    model=model,\n",
    "    tokenizer=tokenizer,\n",
    "    max_new_tokens=256,\n",
    "    temperature=0.1,\n",
    "    do_sample=True,\n",
    ")\n",
    "hf_llm = HuggingFacePipeline(pipeline=pipe)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c79ac8d3",
   "metadata": {},
   "source": [
    "### Step 7 - Supply the hf_llm as well as the retriver we prepared above into langchain's RetrievalQA chain\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6123aa0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create the using RetrievalQA\n",
    "from langchain.chains import RetrievalQA\n",
    "\n",
    "qa_chain = RetrievalQA.from_chain_type(llm=hf_llm, # supply meta llama2 model\n",
    "                                  chain_type=\"stuff\",\n",
    "                                  retriever=retriever, # using our own retriever\n",
    "                                  return_source_documents=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "880f74ac",
   "metadata": {},
   "source": [
    "### Step 8 - We are now ready to ask questions "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51b1179d",
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"When is the film Titanic being made ?\"\n",
    "#query =\"Who is the director for the film?\"\n",
    "llm_response = qa_chain(query)\n",
    "print(\"llm response after retrieve from KB, the answer is :\\n\")\n",
    "print(llm_response['result'])\n",
    "print(\"---\"*10)\n",
    "print(\"source paragraph >> \")\n",
    "llm_response['source_documents'][0].page_content\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06e002f5",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
